-
Notifications
You must be signed in to change notification settings - Fork 320
Implement the new tuning API for DeviceTransform
#6914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the new tuning API for DeviceTransform
#6914
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
4244463 to
43feb21
Compare
fca1221 to
2aade5f
Compare
This comment has been minimized.
This comment has been minimized.
2aade5f to
57cc332
Compare
| #if _CCCL_HAS_CONCEPTS() | ||
| requires transform_policy_hub<ArchPolicies> | ||
| #endif // _CCCL_HAS_CONCEPTS() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I believe we should either use the concept emulation or plain SFINAE in C++17 too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. We could also static_assert, but ArchPolicies is already used in the kernel attributes before we reach the body. And using a static_assert would only be evaluated in the device path.
How would I write that using concept emulation and have the concept check before the __launch_bounds__?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could write:
| #if _CCCL_HAS_CONCEPTS() | |
| requires transform_policy_hub<ArchPolicies> | |
| #endif // _CCCL_HAS_CONCEPTS() | |
| _CCCL_TEMPLATE(typename PolicySelector, | |
| typename Offset, | |
| typename Predicate, | |
| typename F, | |
| typename RandomAccessIteratorOut, | |
| typename... RandomAccessIteratorsIn) | |
| _CCCL_REQUIRES(transform_policy_selector<PolicySelector>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but as discussed on Slack before, we would need to get transform_policy_selector and then policy_selector working, which we couldn't of the is_constant_expression check. Let's leave it.
| bool all_inputs_contiguous = true; | ||
| bool all_input_values_trivially_reloc = true; | ||
| bool can_memcpy_contiguous_inputs = true; | ||
| bool all_value_types_have_power_of_two_size = ::cuda::is_power_of_two(output.value_type_size); | ||
| for (const auto& input : inputs) | ||
| { | ||
| all_inputs_contiguous &= input.is_contiguous; | ||
| all_input_values_trivially_reloc &= input.value_type_is_trivially_relocatable; | ||
| // the vectorized kernel supports mixing contiguous and non-contiguous iterators | ||
| can_memcpy_contiguous_inputs &= !input.is_contiguous || input.value_type_is_trivially_relocatable; | ||
| all_value_types_have_power_of_two_size &= ::cuda::is_power_of_two(input.value_type_size); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: While it is technically more efficient, I believe it would improve readability if we did
const bool all_inputs_contiguous = ::cuda::std::all_of(input.begin(), input.end(), [](const auto& input) { return input.is_contiguous; })There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I do this later? Maybe we have std::ranges::all_of by then.
This comment has been minimized.
This comment has been minimized.
cb0fac5 to
1d14a3e
Compare
This comment has been minimized.
This comment has been minimized.
1d14a3e to
a661d8f
Compare
|
I see tiny changes in the generated SASS for The fill lernel for All kernels with a functor marked as It feels a bit like the items per thread changed for the fill kernels. |
They did. Before we had a tuning policy for sm_120, that was not taken into account :D This PR now uses it. |
|
I disabled the sm120 fill policy and now the only SASS diff for filling is on: which is a |
|
Found the final issue with the fill kernels. Disabled the vectorized tunings when we have input streams (they were tuned for output only use cases). SASS of |
1139c44 to
c8b2ef6
Compare
51c844f to
2ba9470
Compare
gevtushenko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to see the new tuning machinery at work! Code is much more readable now and we no longer have to parse PTX 🎉
This comment has been minimized.
This comment has been minimized.
|
I see EVO correctly failing to build invalid parameter configurations: What's a bit worrying is that in the course of 4h and across 8 GPUs not a single algorithm other than 1 succeeded in building. This may suggest we need a different approach, but we can address this later. |
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 7h 49m: Pass: 100%/133 | Total: 1d 20h | Max: 4h 29m | Hits: 98%/177904See results here. |
PR NVIDIA#6914 seems to have missed a few pieces to remove
PR #6914 seems to have missed a few pieces to remove
Fixes: #6919
Fixes: #5057
Fixes: #3017
cuda::__all_arch_idsandcuda::__is_specific_arch#6916arch_policyrelated topolicy_selectorcub.bench.transform.babelstream.baseon SM75;80;90;100;120cub.test.device.transform.lid_0on SM75;80;90;100;120__CUDA_ARCH_LIST__.)value_typeof CCCL.C output iterators fromvoidto the type they write to. See also: [BUG] CCCL.C output iteratorcccl_iterator_t::value_type::sizedoes not match JIT-compiledsizeof(iterator::value_type)#7169Compile time of
cub.test.device.transform.lid_0using nvcc 13.0 and clang 20 for sm86, sm120branch:
1m49.900s
1m50.615s
1m50.255s
main:
1m56.917s
1m57.378s
1m59.371s
Compile time of
cub.test.device.transform.lid_0for sm86, sm120 using clang 20 in CUDA mode:branch:
real 1m40.627s
real 1m40.675s
real 1m40.912s
main:
real 1m39.273s
real 1m39.669s
real 1m39.835s